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MIXED-RATES ASYMPTOTICS 

By Peter Radchenko 

University of Southern California 

A general method is presented for deriving the limiting behavior 
of estimators that are defined as the values of parameters optimizing 
an empirical criterion function. The asymptotic behavior of such esti- 
mators is typically deduced from uniform limit theorems for rescaled 
and reparametrized criterion functions. The new method can handle 
cases where the standard approach does not yield the complete lim- 
iting behavior of the estimator. The asymptotic analysis depends on 
a decomposition of criterion functions into sums of components with 
different rescalings. The method is explained by examples from Lasso- 
type estimation, fc-means clustering, Shorth estimation and partial 
linear models. 

1. Introduction. Consider an estimator (a n ,b n ) that in some sense opti- 
mizes a random criterion function G n (a, b) over an open subset of M. dl x M. d2 . 
Two types of mixed-rates asymptotic behavior can occur and often occur 
simultaneously. First, the components a n and b n of the estimator may con- 
verge at different rates. Second, the criterion function itself may have impor- 
tant components settling down at different rates. The new method presented 
in this paper can handle both types of mixed-rates behavior. 

Deriving the asymptotics of an estimator can be viewed as a three step 
procedure: proving consistency, establishing the rate of convergence and de- 
riving the limiting distribution. This paper concentrates only on the last two 
steps. The limiting distribution is typically derived via a uniform limit the- 
orem for the rescaled and reparametrized criterion functions. Suppose that 
the rates of convergence for the two components of the estimator have been 
established: g~ 1 ||a n — an|| V r~ 1 ||6 n — 6o|| = O p (l) for some fixed parameter 
value (ao,6o). Consider localized criterion functions of the form 

H n (s, t) := G n (a + q n s, b + r n t) - G n (a ,b ). 
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If, after appropriate rescaling, random functions H n (s,t) settle down to a 
"nice" stochastic process, the convergence in distribution of vectors (s n , t n ) := 
(q~ l [a n — ao],r~ l [b n — 60]) to the corresponding optimizer of the limit process 
may follow from a continuous mapping type of argument. Theorem 3.2.2 of 
van der Vaart and Wellner [21] makes this argument precise for estimators 
defined by maximization. The above approach is standard when the rates 
r n and q n are the same, and it can work in some mixed-rates cases, such as 
the change-point problem (see, e.g., the section on nonregular examples in 
Kosorok [8] ) . Other mixed-rates examples where this argument succeeds can 
be found in Rotnitzky, Cox, Bottai and Robins [16], Pollard and Radchenko 
[12] and Andrews [1]. 

Many mixed-rates problems cannot be completely handled by the above 
approach. In the examples considered in this paper, the localized criterion 
function has the form 

H n (s,t) =a n f n {s) + (3 n g n (s,t), 

where f3 n = o(a n ), the random function f n (s) settles down to a stochas- 
tic process f(s), and g n is stochastically bounded. Because the limit of 
a" 1 H n (s,t) is a stochastic process indexed only by s, the standard approach 
fails to establish the limiting distribution of the component t n . However, 
if random function g n (s,t) settles down to a stochastic process g(s,t), a 
two-step continuous mapping argument can be used to establish the distri- 
butional limit of the vector (s n ,t n ). This general idea is made rigorous by 
Theorem 1 in Section 2. 

Another challenging problem is deriving the correct rates of convergence 
for the two components of the estimator. Standard methods represent the 
centered criterion function G n (a,b) — G n (ao,&o) as a sum of a positive de- 
terministic function and a random one, whose rates of growth around the 
value (ao ) fro) can be controlled (the deterministic function is typically ap- 
proximated by a quadratic, and the random function is often approximately 
linear). Balancing out the two terms produces the rate of convergence: see, 
for example, Theorem 3.2.5 and Theorem 3.2.16 in van der Vaart and Well- 
ner [21]. When a n and b n converge at different rates, this approach yields the 
"correct" rate only for the slower converging component. A reparametriza- 
tion of the problem can sometimes be applied beforehand to sidestep this is- 
sue (for interesting examples, see the references at the end of the paragraph 
on the standard method for deriving the limiting distribution). Unfortu- 
nately, such a trick is not available in general, and a more careful treatment 
of the criterion function is required. To derive the rate for the faster converg- 
ing component, say, b n , Theorem 2 in Section 3 balances out the terms in 
a similar, but typically a more complicated, representation for the function 
b^ [G n (a n ,b) - G n (a n ,b )]. 
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Section 4 is devoted to mixed-rates problems that arise in M-estimation. 
Consider a collection of functions go{x) and an empirical measure P n , corre- 
sponding to independent observations coming from a distribution P. Define 
the estimator 9 n as the minimizer of the criterion function G n {9) = P n ge, 
and suppose that function G(9) = J ge dP is minimized by 9q. The stochastic 
bound \\9 n — 9q\\ = o p (1) usually follows from a uniform law of large numbers, 
and the central limit theorem for the estimator is typically derived from a 
quadratic approximation of the form 

G n (9) - G n (9 ) (0 - 9o)'G"(9 )(9 - ) + n^iQ - 9 )'Z n , 

under the regularity assumption that matrix G"(9q) is a positive definite ma- 
trix. If this regularity assumption breaks down and G"{9q) is singular, the 
approximation has to be carried out to higher order terms, which typically 
leads to mixed-rates situations that standard methods cannot handle. The- 
orem 3 covers exactly such cases. The form of the approximation to function 
G{9) near 9q determines the rates of convergence and the main features of 
the limiting behavior of the components of the estimator. Various remainder 
terms are handled by simple conditions imposed on functions gg. 

Mixed-rates behavior naturally arises in the estimation of semiparametric 
models. Most of the results in this paper do not directly apply to such prob- 
lems, but, as the example in Section 8 demonstrates, some of the methods 
and ideas can be carried over. 

For the simplicity of the presentation, the estimators and the criterion 
functions considered in this paper have at most two components converging 
at different rates. All the results can be easily extended to cover cases of 
more than two mixed-rates components. 

This paper is organized as follows. Sections 2, 3 and 4 contain the general 
mixed-rates asymptotics results, namely, the limiting distribution theorem, 
the rates of convergence theorem and the M-estimation theorem. Proofs of 
these theorems are confined to Section 9. Sections 5, 6 and 7 contain applica- 
tions of the general results to particular problems in Lasso-type estimation, 
shorth estimation and fc-means clustering. Section 8 discusses a semipara- 
metric example. 

The abbreviation Qf = J f dQ is used throughout the paper for a given 
measurable function / and a signed measure Q. In particular, given in- 
dependent observations Xj coming from a distribution P, let P n f denote 
J2i<n f(Xi)/n and define the empirical process u n on a class of functions / 
by" 

n 

f^v n f = n x l\P n - P)f = n-WYtfiXi) - Pf]. 

i=i 

Write || • ||2 for the Li2{P) norm and say that a function / is square-integrable 
if H/lb < 00. Interpret f(9) >g(0) to mean that there exists a positive con- 
stant cq such that f(9) > cog(9) for all 9 in a sufficiently small neighborhood 
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of the origin. Analogously, interpret a n > (3 n to mean a n > co(3 n for all suf- 
ficiently large n. 

2. Limiting distribution. Let the estimator (a n ,b n ) converge in probabil- 
ity to a fixed parameter value (ao, bo). Suppose that the rates of convergence 
q n and r n have been established for the components a n and b n , respectively. 
Vector (s n ,t n ) := (g" 1 [a n — ao], r~ l \b n — bo\) optimizes the localized criterion 
function H n (s,t) and satisfies the tightness condition ||(s n ,t n )|| = 0*(1). Fo- 
cus on deriving the limiting distribution of (s n ,t n ) when it is defined by 
minimization. 

To avoid some measurability issues by allowing nonmeasurable maps, con- 
vergence in distribution (denoted by "W) is understood in the sense of 
Hoffmann-j0rgensen. An exposition of this general concept can be found 
in the monographs of Dudley [3] and van der Vaart and Wellner [21]. Let 
Bi oc (R d ) be the space of all locally bounded real functions on M. d . Conver- 
gence of the random processes considered in the examples of this paper is 
handled by equipping B\ oc (R. d ) with the metric p for the topology of uniform 
convergence on compacta: 
00 

p(.9,h) = ^2 _fe min[l,p fc (#,/i)] where p k (g, h) = sup \g(t) - h(t)\. 
k=i ll*ll< fe 

In Theorem 1, convergence of the components of the criterion function 
should be understood with respect to this metric. The following continuity 
property of the argmin functional with respect to p simplifies the state- 
ment of the theorem. Let x* be the clean minimum of a function h in the 
sense that the strict inequality h(x*) < inf £ <| x _ x .|< r h{x) is satisfied for all 
positive r and e. Then 

(1) p{h n , h) — * implies argmin h n (x) — > x* for each r > 0. 

\x—x*\<r 

Note that the unique minimum of a continuous function is also its clean 
minimum over each large enough ball. In fact, lower semicontinuity of the 
function is sufficient. The proof of Theorem 1 would remain valid if B\ oc (M. d ) 
were equipped with a different metric d, as long as assumption (1) were 
imposed explicitly and formulated in terms of d. 

The following result is stated in the cleanest form that covers the exam- 
ples considered in the paper, thus, some of its conditions can be relaxed. 
See Remark for the alternative to the continuity assumption placed on the 
sample paths of the limit process (f,g)- Also note that the sample path 
properties required of the limit process need to hold only almost surely. 

Theorem 1. Let H n be random criterion functions on M dl x M. di and let 
(s n ,t n ) be random vectors inM. dl x]R d2 . Suppose that the following conditions 
are satisfied: 
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(i) H n (s,t) = a n f n (s) + f3 n 9n(s,t) , where f n and g n are random func- 
tions on M. dl and M. dl x M. d2 respectively, while a n and j3 n are positive num- 
bers with (3 n = o(a n ); 

(ii) (f n ,9n) ~^ (/><?) an d the limit process has continuous sample paths; 

(iii) H n (s n ,t n ) <M Sit H n (s,t) + o*(/3 n ) and 

(iv) ||( Sn ,t n ,)|| = o;(i). 

Assume that the sample paths of /(•) possess a unique minimum at a (ran- 
dom) point s* and the sample paths of g(s*,-) possess a unique minimum at 
t*. Then (s n ,t n )^(s* } t*)- 

Remark. The assumptions on the sample paths of the limit process 
(/, g) can be relaxed as follows. Assume that s* and t* are measurable ran- 
dom points such that for almost all sample paths of the limit process: 

(a) s* is the "clean" minimum of /(•), 

(b) t* is the "clean" minimum of g(s*,-) and 

(c) for each ball B, the set of functions {<?(-, t) : t £ B} is equicontinuous. 

Theorem 1 can be generalized to cover cases where the optimizer is not 
defined by minimization or maximization. Suppose that vectors (s n ,t n ) sat- 
isfy equalities s n = ^[H n (-,t n )] and t n = $>[H n (s n ,-)] for certain maps ^ 
and <I>. Assume that these maps are invariant to multiplications by positive 
constants and that $ is also invariant to translations. If, in addition, each 
map satisfies assumption (1) with the proper replacement for the argmin, 
the proof of Theorem 1 still goes through. For a rigorous account of this 
fact, see Theorem 1 in Radchenko [15]. 

3. Rates of convergence. Consider two-component estimators (a n ,b n ) 
that are defined by minimizing random criterion functions G n (a,b). The 
following lemma uses an approximation to the criterion function to establish 
the rate of convergence of the slower converging component a n and makes an 
initial guess at the rate of convergence of the component b n . This guess is not 
quite correct, but it provides an improvement over existing results, which 
establish one convergence rate for the whole long vector (a n , b n ). Lemma 
1 requires a particular representation for the criterion function. In many 
standard asymptotic problems, this representation is satisfied with the term 
M n (a, b) bounded below by a nonsingular quadratic, and the term N n (a, b) of 
the order O p (n -1 / 2 ||(a, 6)||), which yields the usual ra -1 / 2 rate of convergence. 
The lemma handles cases that are more general. 

Lemma 1. Suppose that inequalities G n (a n ,b n ) <G n (0,0) hold together 
with the stochastic bound \\(a n ,b n )\\ = o*(l). Let a and (3 be positive numbers 
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satisfying a > j3, and let {71, . . . ,7p, 771, . . . ,rj p } be a collection of nonnega- 
tive numbers satisfying 7, < a for all i £ {1, . . . ,p}. Suppose that criterion 
functions G n satisfy a representation 

G n (a, b) - G„(0, 0) = M n (a, b) - N n (a, b), 

such that 

M n (a n , b n ) > ||an|| a + ||&n|| with inner probability tending to one, and 

[N n (a n , b n )]+ = O; (]T n-* || (a n ,b n )\\A . 

\i<p ) 

Define r a = min 4 < p (^_). Then ||a n || = 0* p (n- T ") and \\b n \\ = 0;(n- QT ^). 

Once the convergence rate of a n is established, it becomes reasonable to 
fix a = a n and consider the function b>— ► G n (a n ,b). Existing results do not 
necessarily yield the convergence rate of the minimizer of this function. The 
point of difficulty is that the leading terms in the approximation to this 
function near its minimum are more complex than the ones that appear in 
the standard asymptotics. The following theorem can handle such cases but 
it requires a more refined approximation to the criterion function. One may 
want to use the help of Lemma 1 to obtain such an approximation (see, e.g., 
the proof of Theorem 3), and then apply Theorem 2 to derive the "correct" 
convergence rate of b n . Note that Theorem 2 places no assumptions at all 
on the space containing the a-component. 

Theorem 2. Let G n (a,b) be a function of two components, where the 
first component belongs to an abstract set, and the second belongs to a 
Euclidean space. Suppose that inequalities G n (a n ,b n ) < G n (a n ,0) hold to- 
gether with the stochastic bound \\b n \\ = o*(l). Let (3 be positive and let 
{a\ , . . . , a p , 0i, . . . , (3p\ be a collection of nonnegative numbers satisfying < 
(5 for all i £ {1, . . . ,p}. Assume that G n satisfies a representation 

(2) G n (a, b) - G n (a, 0) = M„(a, b) - N n (a, b), 

such that 

M n (a n ,b n ) > \\b n \\P with inner probability tending to one, and 




Then \\b n \\ = 0* p {n-^) for r h = mi ni < p {^}. // [N n ]+ = then F,{b n = 
0}-l. 
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4. M-estimators. The following definition introduces notation that is 
used in the statement of Theorem 3. This notation simplifies the work with 
polynomials that are homogeneous functions of the elements of vector (a, b) 
and the absolute values of the elements of vector (a, b). 

Definition 1. Let ip be a real valued function on M. d and let 7 be a 
positive constant. Say that ip G H± (7) if ip{X0) = X lr ip(9) for all A > and 
ip(6) > for all (9/0. 

Let cj) be a real valued function on IR^ 1 x M d2 and let a and (3 be some posi- 
tive constants. Say that (j) £ \a,(3) if ^(Ai<2, A2&) = (Ai) a (A2) /3 </>(a, 6) for 
all nonnegative Ai and A2, while function eft assumes at least some negative 
values. 

Remark. For each continuous function ip(9) in the class -fTT 1- (7), there 
exist positive constants c\ and C2 such that ci||#|| 7 < ip(0) < C2||6>|| 7 . 

Suppose that X\,X 2 , ■ ■ ■ ,X n are independent observations in M fc coming 
from a distribution P and write P n for the corresponding empirical distribu- 
tion. Suppose that A is an open subset of R dl x R d2 and let {g a ,b(x) ■ (a, b) £ 
A} be a collection of real valued P-integrable functions on R fe . Assume 
that this collection of functions is centered to satisfy 50,0 = 0. Suppose that 
vectors (a n ,b n ) minimize over A the random criterion functions G n (a,b) = 
Pn9a,b an d let (0, 0) be the corresponding minimizer of the population analog 
G(a, b) = Pg a ,b- The following theorem derives the asymptotics of (a n ,b n ) in 
the challenging case of the singular second derivative matrix G"'(0,0). 

Theorem 3. Let {a,(3,~/i, . . . ,7 P ,r/i, . . . ,rj p } be a collection of positive 
numbers. Assume that a > (3 > 1 and [3 > rjj for 1 < j < p. Suppose that there 
exist continuous functions 7pi(a) £ Hf (a), ip 2 {b) £ H^({3) and <pi(a : b) £ 

i?2 \'~Yi,'ni) f or 1 — ^ <p, such that near the origin the population criterion 
function satisfies the following conditions: 

(i) G( 0j 6)>H« + ||6||^ 

(ii) G(a,0) =ipi(a) + o(\\a\\ a ) and 

(iii) G(a,b) = G(a,0) + ^(6)[1 + o(l)] + ELi&(M)[l + o(l)] + 

o(Etiii«ir~ii^ii 1 )- 

iei T « = 2(a-l) > A ° = 2(/3 1 -l) ' = ^ 0r 1 - i - P> and de fi ne T b = 

mino<i< p [Aj]. Suppose there exist on M fc five square integrable functions, Ai 
(taking values in W dl ), A2 (taking values in W d2 ) and real valued r a ^,s a ^ 
and l a £, such that: 

(iv) g a ,b( x ) =a'Ai(x) + b'A 2 (x) + ||(a,6)||r aj6 (a;); 

(v) g a ,b(x) - g a ,o(x) - b'A 2 (x) = l a ,b(x) + \\b\\s a)b (x); 
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( vi ) su P||(a,b)||<5„ Wnr a ,b\ = o p (l) and sup|| (0)6) ||< 5n \v n s a , h \ = o p (l) for all 

(vii) sup|| a ||< 5n) || 6 ||< £n \v n l a ,b\ = o p {n~ f3Tb+l / 2 ) for all 5 n = 0(n~ Ta ),e n = 
0(n- aT «/P). 

Assume that ||(a n ,6 n )|| = o p (l). If ar a = 0Tb, then 



(n Ta a n ,n Tb b n ) argmin 

s,t 



r/>i(s)+s% + V> 2 (i) 



+ 1{A = n}t'Z 2 + = n}(j)i{s, t) 



otherwise (n Ta a n ,n Tb b n ) ~> (s* where 
s* = argrnin[V>i(s) + s'-Zi], 

p 



f = arg min 



^(t) + 1{A = n}t'Z 2 + ^ 1{A 4 = r fc }^(s*,t) 



i=l 



.Here {Z\, Z 2 ) is a mean zero Gaussian vector with covariance matrix P(A\,A 2 ) x 

(Ai,A 2 y. 

Note that a stochastic process v n fa,b necessarily satisfies the uniform 
stochastic bound required in condition (3) of the above theorem (cf. asymp- 
totic equicontinuity defined in van der Vaart [20]) if functions f a ^ form a 
Donsker class and ||/ a ,&||2 —> as ||(o, b)\\ — > 0. Simple ways of checking that 
a class of functions is Donsker are given, for example, in van der Vaart's 
Theorem 19.5 and Theorem 19.14. 

To illustrate the variety of asymptotic results produced by Theorem 3, 
consider some simple approximations to the function G, which has a singular 
second derivative at the origin, where its minimum is located. Let (a, b) 6 R 2 
and consider the case G(a, b) ~ o 4 + b 2 . Theorem 3 yields (n 1 / 6 a n , n 1//2 6 n ) ~» 
(arg min s [s 4 + sZ\], arg min^ [t 2 + tZ 2 \) if the conditions (iv)-(vii) are satis- 
fied. Here (Zi,Z 2 ) is a mean zero Gaussian vector. Now consider the case 
G(a, b) ~ a 4 + b 2 + a 2 b. Under the same assumptions, the theorem yields 
(i! 1 / 6 a n ,n 1 '' 3 & n ) ~» (argmin Sjt [s 4 + sZ\ + t 2 + s 2 t]). If the approximation is 
G(a, b) a 4 + b 2 + a 3 b, the corresponding result is (n 1 / 6 ^, n 1 / 2 6 n ) ~-> 
with s* = argmin s [s 4 + sZ{\ and t* = arg mint [i 2 + (s*) 3 t + tZ 2 ]). Note that 
Theorem 3 does not attempt to cover every conceivable approximation to 
G(a,b), as the statement of the result would become too long and compli- 
cated, but each such situation can be handled with only minor modifications 
to the proof of the theorem. 
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5. Example: Lasso-type estimators. Assume that the observed variables 
Yi satisfy the linear model 

Y i = x' i f3 + e i , i = l,...,n. 

The errors £j are independent and identically distributed random variables 
that have mean zero and variance a 2 . The parameter is a vector in R d 
that needs to be estimated. The covariates Xi are fixed and centered, and 
the matrix C n = - J2i~=i x i x \ is nonsingular. 

Suppose A n and 7 are positive real numbers. Define the "Lasso-type" 
estimator (3 n as the minimizer of the penalized least-squares criterion, 

n d 

W n (a) = J2( Y i - < a ? + l a il 7 ' 

i=l j=l 

over all vectors a = (a\, . . . , ad)' ■ In the particular cases of 7 = 1 and 7 = 2, 
this estimator corresponds, respectively, to the "Lasso" of Tibshirani [18] 
and the ridge regression. For general 7, such estimators were introduced 
by Frank and Friedman [4]. The limiting behavior of the estimator (3 n was 
described by Knight and Fu [7] under certain conditions on the growth rate 
of the weighting sequence {A n }. 

Assume that the design satisfies the following regularity conditions: 

(i) matrixes C n converge to a fixed matrix C; 

(ii) as n tends to infinity, n _1 maxj< n (x^Xj) converges to zero. 

In the case of the nonsingular matrix C, Knight and Fu derived the *Jn- 
asymptotics for f3 n after setting the growth rate for the weighting sequence 
{A n }. They required that, for some nonnegative constant Ao, 

(3) X n /n min ^ 2 ^^ -» A . 

Note that when Ao = 0, the penalty contribution is asymptotically negligible 
and the limiting behavior of the estimator (3 n is the same as that of the usual 
least-squares estimator. 

To derive the asymptotics of (3 n , Knight and Fu used a standard approach 
that is based on rescaling the parameters at the same rate and applying a 
continuous mapping type of argument. When vector (3 has a zero component, 
7 < 1, and A n grows faster than the rate given in (3), this approach fails to 
deliver the complete asymptotics. For concreteness, consider the case d = 
2,(3 = (1, 0)', 7 = 1/2, and set X n = Ao^ 1 / 2 for some positive constant Ao- The 
standard approach establishes the asymptotics of the first component of f3 n , 
but only yields the o p (n -1 / 2 ) stochastic order for the second component of 
the estimator (see Knight and Fu [7], page 1361). The techniques developed 
in Section 3 are applied below to show that the second component is in fact 
exactly zero with probability tending to one. 
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Because C is nonsingular and A n = o(n), the estimator (3 n is consistent 
(see Theorem 1 of Knight and Fu [7]). The proof is based on the fact that, 
for each fixed a, the penalty part of the criterion function W n (a) is asymp- 
totically negligible compared to the least-squares part. Focus on vectors a 
that are near the true parameter (3, and write a = (3 + (a,b)'. Express the 
penalized criterion function in terms of a and b. Denote n~ 1 [W n (a) — W n ((3)] 
by G n (a,b), and let Z n stand for n" 1 / 2 Ya=i £i x i- The regularity conditions 
on the design guarantee that the sequence of random vectors Z n has a lim- 
iting Gaussian distribution with mean zero and covariance a 2 C. As a and b 
tend to zero, 

G n (a, b) = (a, b)C n (a, b)' - 1n~ x l 2 (a, b)Z n 

+ ^ n -l/2 a[1 + o(1)] + Aon -l/2| i) |l/2_ 

The o(l) terms come from the Taylor expansion of |1 + a\ l l 2 near a = 0. 
Function G n is minimized by the vector (a n ,b n )' that is defined as the dif- 
ference between (3 n and (3. 

Define M n (a,b) to be (a, b)C n (a, b)' + Aon" 1 / 2 ^ 1 / 2 and let N n equal 
G n — M n . Note that for all n large enough, the eigen values of the sequence 
of matrixes C n are bounded away from zero. Apply Lemma 1 from Section 
3 and conclude that ||(a,6)|| = O p (n~ 1 / 2 ). 

Let v n denote the bottom right element of the matrix C n . Observe that 

G n {a n ,b n ) - G n (a n ,0) = v n b 2 n + O^n- 1 / 2 ^) + Aon" 1 / 2 ^ 1 / 2 . 

Note that Ao and v n are positive and v n is bounded away from zero for all 
sufficiently large n. Deduce that, with probability tending to one, the right- 
hand side of the above display is bounded below by cb 2 n for some positive c. 
Apply Theorem 2 with N n = and conclude that ¥{b n = 0} — > 1. 

More examples of mixed-rates behavior in Lasso-type estimation can be 
found in Radchenko [14]. 

6. Example: Shorth. Assume that the observations are independently 
sampled from a distribution P on the real line and let [m n — r n , m n + r n ] be 
the shortest interval that contains at least half of the first n observations. 
The shorth estimator is defined as the average over such an interval, but the 
goal of this section is the limiting behavior of m n and r n . Griibel [5] derived 
the root-n asymptotics for r n and Kim and Pollard [6] derived the cube root 
asymptotics for m n . The methods of the present paper allow one to establish 
the joint limiting behavior of (m n ,r n ) using a simple approximation to the 
criterion function. 

Denote by \i and p the population solution, in other words, let [fi — p, fi + p] 
be the shortest interval to which P assigns at least half the probability. 
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Assume that the population solution is unique and let P have a bounded 
density / that is differentiable at the endpoints p ± p. Define the criterion 
function V n by V n (e,8) = P n [{p + e) - (p + 5), (p + e) + (p + 5)] - 1/2, and 
let V(e,S) denote the population analog obtained by replacing P n with P. 
Observe that V(0,0) = and write out the Taylor expansion for function V 
near the origin: 

(4) V(e, 5) = aS + c 2 e 2 + c 3 e5 + c A 5 2 + o(e 2 + 5 2 ), 

where the coefficients are c\ = /(p — p) + f(p + p),C2 = C4 = [f'(p + p) — 
/'(p — p)]/2 and C3 = /'(pi + p) + f'(p — p). The coefficient of the linear term 
in e equals zero because the function V(e, 0) is maximized at e = 0. This 
forces the equality /(// + p) = /(/i — p). By the same reasoning, coefficient 
C2 must be nonpositive. Assume C2 < and c\ > for regularity. 

Recall the bound sup m r \P n [m — r,m + r] — P[m — r, m + r]\ = Op^" 1 / 2 ) 
from the standard empirical process theory. Denote this supremum by A n . 
Uniqueness of the population solution and regularity assumptions on the 
coefficients of the Taylor expansion (4) guarantee that there exists a positive 
constant c such that, for all small enough positive S, inequality sup m P[m — 
(p — 5) , m + (p — 5)] < 1/2 — c5 holds. Consequently, 

sup P n [m - (p - A n /c), m + (p- A n /c)} <A n + 1/2 - cA n /c = 1/2, 

m 

and hence, 5 n > — A n /c. Expansion (4) also implies existence of a positive 
constant b such that P[p — (p + 5),p + (p + 5)] > 1/2 + b5 for all small 
enough positive 5. Take 5 = A n /6 and deduce that P n [p — (p + A n /b),p + 
(p + A„/6)] > 1/2. Conclude that S n < A n /b and, hence, 5 n = O p (n -1 / 2 ). 

Note that function V n (-,5 n ) is maximized by e n and function V(-,0) has 
a clean maximum at zero. Uniform convergence in probability of V n (-,5 n ) to 
V(-,0) implies e n = o p (l). Introduce functions M n (e,5) = |c2|e 2 /4 and define 
functions N n by equalities 

- [V n (e, 5) - K(0, 5)] = M n (e, 5) -N n (e,5). 

Note that when 5 = 5 n , the expression on the left-hand side is minimized 
by e = e n . Denote the difference between the indicator functions of intervals 
[(p + e) - (p + 5),(p + e) + (p + 6)} and [p - (p + 5), p + (p + 5)] by J(e,S). 
Observe that 

V n (e, 8) - V n (0, 5) = V(e, 5) - U(0, 5) + (P n - P)J(e, 5). 

Recall that ci < by the regularity assumptions placed on the coefficients 
of expansion (4), and use the Taylor expansion (4) to deduce a stochastic 
bound 

(5) N n (e n ,S n ) < (P n - P)J(e n ,5 n ) - \c 2 \e 2 j2 + O p {n- l ' 2 \e n \) + O^n" 1 ). 
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Note that the collection of functions J(e,5) is a Vapnik-Cervonenkis class. 
For R near zero, the envelope function Gr = swpr E i2 + p <R 2\\J(£,5)\ is the 
indicator of the two intervals of total length bounded above by 4i?; bound- 
edness of the density implies PG 2 R = 0(R). Hence, the conditions of Lemma 
4.1 of Kim and Pollard [6] are satisfied and, consequently, the bound \ (P n — 
P)J(e n ,5 n )\ — ce 2 n < O p (n _2 / 3 ) is valid for each positive c. It follows that 

\{P n - P)J(e n ,5 n ) - |c 2 |4/2]+ = O p (n" 2 / 3 ). 

Combine this stochastic bound with bound (5) and deduce that 

[N n (e n , 5 n )} + = O p (n^ 2 \e n \) + O p (n^ 3 ). 

An application of Theorem 2 yields e n = O p (n -1 / 3 ). 

Set / s ,t = + n-VH) - (p + n~ 1 /2 s ) ) ^ + n -i/3 t ) + (p + n'^s)] - 
l\p — p, fJ> + p] and define the localized criterion functions H n (s, t) = 14i(n~ 1 / 3 t, 
n _1 / 2 s). Use the empirical process notation to write an approximation to 
H n (s,t) that holds uniformly on compacta: 

H n (s,t) =n~ l/2 {cis + u n [p- p,p + p}} 

(6) 

+ n - 2 ^{c 2 t + n 1 ^ n I s , t + o(l)}. 

On each compact set, the stochastic processes X n (s,t) = n 1 / 6 i/ n / n i/ 6s t 
converge in distribution to a tight Gaussian process by Theorem 19.28 of van 
der Vaart [20]. The conditions of the theorem are checked in van der Vaart's 
Example 19.29 for essentially the same process as X n . Consequently, X n 
satisfies the asymptotic equicontinuity condition of van der Vaart's Theorem 
18.14, and approximation n l l^v n [I s j — v n Io,t] = °p(l) holds uniformly over 
s and t in each given compact set. Note that 

lim n 1 / 3 Pi 0t J 0t , = cimin(|t|,|t / |) 

and 

lim n 1/6 PIo !t l[p- p,p + p}=0. 

Write f n (s) and g n (s, t) for the two expressions in curly brackets that appear 
in representation (6). Let B{t) be a two-sided Brownian motion and let Z 
be an independent iV(0, 1/2) random variable. Conclude that 

(fn(s),g n (s,t)) ~» (c\s-Z, c 2 t + y/c[B{t)). 

Recall that the rescaled solution (s n ,t n ) = {n l l 2 5 n , n 1//3 e n ) is stochastically 
bounded and note the relationship 

s n = inf {s:H n (s,t n ) > 0} and t n = arg max [il n (s n , •)]. 
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Consider the functional \I/ : h >— > inf{s : h(s) > 0} and note that the value 
^[H n (-, t)] is well defined and finite for each t. Also note that is invariant 
to multiplications by positive constants. Apply Theorem 1 in Radchenko 
[15] and express the result in terms of the original variables: 



V t J 

Standard techniques fail to extract the limiting behavior of the estimator 
directly from approximation (6) because the first component of the approx- 
imation dominates the essential second component as n tends to infinity. 

7. Example: fc-means. The /c-means procedure divides observations X%, 
. . . , X n in W 1 into k sets by locating the cluster centers and then assign- 
ing each observation to the closest center. The set of cluster centers C n = 
{c± n , . . . , Ckn} is chosen to minimize 



as a function of sets C = {c\, . . . , c^} of k not necessarily distinct points 
in M d . Assume that the observations are independent and come from a 
distribution P on R d . Define the population criterion function, W(C) = 
PminKjXfc ||Ai — Cj|| 2 , and let Co be a set that minimizes W. Note that 
if P has a finite second moment and is not concentrated on fewer than k 
points, then each set of optimal population centers has to contain exactly k 
points. Under these conditions, and given that the set Co of optimal pop- 
ulation centers is uniquely defined, Pollard [9] showed that the sets C n of 
optimal empirical centers are strongly consistent with respect to the Haus- 
dorff metric. 

In the example that follows, condition (vi) of Theorem 3 needs to be 
verified for classes of functions that possess the following simple property. 

Property 1. The class of functions fe(-) satisfies the following condi- 
tions: 

(i) the envelope function F(-) is square integrable with respect to P; 

(ii) there exist positive integers N and d such that each fg can be repre- 
sented as a sum of at most N functions of the form LQ, where L is a linear 
function and Q is the indicator function of the intersection of at most N 
half-spaces in M. d ; 




(7) 




i<n 



(iii) ||/fl|| 2 -0 as 6^0. 



The first two conditions imply that the class of functions fg is Donsker. 
This fact is proved on page 921 of Pollard [10], but it can also be easily 
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deduced from the standard results on pages 274-276 of van der Vaart [20]. 
The third condition together with the Donsker property yield the required 
sup|| ||< 5n \v n fe\ -> for each 5 n -> 0. 

The following is a two-dimensional extension of the example discussed in 
Section 7.1 of Radchenko [15]. Consider a distribution Q on the plane (x,y) 
that concentrates on the lines {x = 1} and {x = —1}. Let Q put probability 
one half on each line, and let the conditional distribution on each line be the 
double exponential. Write Q as P x fi, where P is the double exponential 
distribution and fi{— l} = //{l} = l/2. 

There are two pairs of optimal population centers, {(— 1, 0), (1, 0)} and 
{(0,-1), (0,1)}; denote them by C v = {c\,c%} and C h = {c\, c§}, respec- 
tively. The superscripts reflect either the vertical or the horizontal direction 
of the split line, which is defined as the common boundary for the two 
Voronoi half-planes generated by a given pair of centers. Let and C„ 
minimize the criterion function (7) over two fixed nonover lapping Hausdorff 
neighborhoods of the sets C v and C h , respectively, and let C n be a global 
minimizer. A slight extension of Pollard's consistency result yields 

C n £ {C^,C^} with probability tending to one, 

Cl -> C h and C v n -> C v almost surely, 

where the set convergence is understood with respect to the Hausdorff met- 
ric. In fact, the probability with which C n chooses between the two configu- 
rations converges to a half. Near the set C h , the population criterion function 
W is approximated by a nonsingular quadratic. As a result, the solution 
settles down at the standard n _1//2 rate and satisfies a central limit theorem. 
The remainder of the section is concerned with deriving the asymptotics of 
C%, which is a challenging problem because the quadratic approximation to 
the population criterion function near the set C v is singular. 

Suppose that C = {ci, C2} is a candidate to minimize the criterion function 
(7) over a small Hausdorff neighborhood of the set C v . Let c\ = {c\ x ,c\y) 
be the point lying close to c" 2 = (—1)0) and let C2 = (c2 X ,C2 y ) lie close to 
c" 2 = (1,0). Write z to denote a point on the plane and let (x,y) be the 
coordinate form of z. Introduce new variables by 

5 S = l(cix + C 2 x), $d = 1 + 2"( c lz - c 2x), 
£ s = ^(ci y -C2y) and e d = \{c\ y + c 2y ). 

These variables contain the information on how far the centers in the set C 
lie from the corresponding centers in C v . Define a = (5 s ,£d) and b = (Sd,£ s ). 
Let g a ,b(z) be the squared distance from z to the closest center in C, written 
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in terms of (a, b) and centered: 

9a,b{z) = [{x + l-5 s - 5 d f + {y-e s - e d f] 

A [(X - 1 - 8s + S d ) 2 + {y-e s + e d ) 2 } 
-||z + (l,0)|| 2 A||z-(l,0)|| 2 . 

Define criterion functions G n (a,b) = P n 9a,b and G(a,b) = Pg a ,bi and note 
that they are just the functions W n (C) and W(C) centered at the set C v 
and written in terms of the new variables. Note that G(a, b) is minimized 
at zero and the points (a n ,b n ) that minimize G n (a,b) are of order o p (l) 
because of consistency. 

The following approximation holds for G near zero (see Section 2.3 of 
Radchenko [13]): 

G(a,b) = ±(\5 s \ + \e d \f + l\\5 s \-\e d \\ 3 

+ 5 2 d + e 2 s + 5% + 25 s e d e s - e\b d + 0(\\(a, 6)|| 4 ). 

Note that conditions (i), (ii) and (hi) of Theorem 3 are satisfied with a = 
3, (3 = 2,p = 3 and (7i, ?7i) = (2, 1) for 1 < i < 3. The corresponding homoge- 
neous functions are 

Ma) = l(\S s \ + M) 3 + §||5 S | - Ml 3 , Mb) = o 2 d + e 2 s 

and 

(pi(a,b) = 5 2 5 d , 4> 2 (a,b) = 25 s e d e s , <f> 3 (a, b) = -£ 2 d d d . 

Take functions Ai(z) and ^{ z ) hi condition (iv) as the Li partial deriva- 
tives of g a ,b( z ) with respect to a and b at (0,0). For example, let 

b'A 2 (z) = -[25 d (x + 1) + 2e s y]H_(z) + [25 d (x - 1) - 2e s y]H + {z) 1 

where H_{z) is the indicator function of the half-plane {z : x < 0} and H + (z) 
is the indicator functions of the half-plane {z:x > 0}. Condition (vi) for 
the remainder functions r a \,{x) follows from a general result on /c-means 
(see Pollard [10], Lemma B). The proof essentially consists of verifying that 
Property 1 holds for the class {r a ^}. 

The expression for g a b(z) depends on the sign of x and on which of the 
centers in the set C lies closer to the point z. Let D and U be the x- 
coordinates of the crossing points of the split line corresponding to C with 
the lines {y = — 1} and {y = 1}, respectively. Note that when 6 = 0, the 
values D and U are simply 5 S — e d and 5 S + e d . Introduce functions 

A(z) = l{\x\ < |£)|,xD>0,i/ = -l} + l{|x| < \U\,xU >0,y = l} 
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and 

A°(z) = A(z) - 1{\ X \ < \6 a - e d \,x(5 s -e d )>0,y = -1} 
- l{\x\ < \5 s + e d \,x(5 s + e d )>0,y = l}. 

Simplify the notation for products of indicator functions by writing, for 
example, AH + (z) for A(z)H + (z), and derive that 

9a,b( z ) ~ 9afl{z) -b'A 2 (z) 

= 5 2 + e 2 + 2(6 s 6 d + e s e d )[H_(z) - H + (z)} 

(8) 

+ A(x6 d - 6 s 5 d - e s e d )[AH„{z) - AH + {z)\ 

+ A(5 S -x + ye d ) [A°H_ (z) - A°H + (z)} . 

Define the remainder functions s a ^{z) by equalities 

\\b\\s a , b (z) =5 2 d + e 2 s + 2{5 s 5 d + e s e d )[H.{z) - H + (z)] 

+ A(x5 d - 5 s 5 d - e s e d )[AH_(z) - AH+(z)], 

and observe that Property 1 holds for the class {s aj b}. Thus, conditions (v) 
and (vi) of Theorem 3 are satisfied if the functions l a ,b( z ) are defined as 
the remaining part of expression (8), namely, A(5 S — x + ye d )[A° H_(z) — 
A°H + (z)]. Define r a = 1/4 and 77, = 1/2 as Theorem 3 prescribes. It is only 
left to check condition (vii) of Theorem 3 by establishing 

(9) sup \u n l a ,b\ = o p (n~ 1/2 ) 

for all sequences of rectangles M n of the order 0(n~ 1 ^ 4: ) x 0(n _3//8 ) that 
are centered at the origin. Write out the Taylor approximations U = S s + 
£d + S d e d - e s e d + o(\\(a,b)\\ 2 ) and D = 5 S — e d - 5 d e d - e s e d + o(\\(a,b)\\ 2 ), 
and conclude that quantities \D — (S s — e d )\ and \U — (S s + e d )\ are of order 
0(?i~ 5 / 8 ) uniformly over (a, b) in the neighborhoods M n . Use the oscillation 
properties of the empirical process established on page 765 in Shorack and 
Wellner [17] to conclude that 

sup \v n AH-(z) - v n AH + {z)\ = o p (n~ 1/4 ). 

(a,b)£Af„ 

Stochastic bound (9) follows directly. 

Apply Theorem 3 and deduce that (n 1//4 a n , n 1/,2 6 n ) (s*,t*), where 

s* = argmin[^i(s) + s' Z\] 

s 

and 

3 
i=l 

A closed form expression for (s*,t*) is given in Section 2.3 of Radchenko 
[13]. 



arg mm 

t 
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8. Example: partial splines. The following semiparametric example is 
discussed in Van de Geer ([19], Chapter 11), where a CLT is established 
for the parametric component using its characterization as a zero of the 
derivative of the criterion function. Below, the same result is derived by 
working directly with the definition of the estimator as a minimizer, using 
the approach introduced in Sections 2 and 3 for mixed-rates parametric 
problems. 

Let (Y"i, Zi), . . . , (Y n , Z n ), . . . be independent copies of (Y, Z), where Y is 
a real-valued response variable and Z is a covariate. Suppose, for simplicity, 
that Z takes values in [0, l] 2 , write Z = (U, V) and assume that the model 

Y = g(U,V) + W 

is satisfied with E(W\Z) = and g(U,V) = 8U + j(V). Here 6 G R is an 
unknown parameter, and 7 is an unknown member of the functional class 

S = (77: [0,1] ->R, f 1 \i] (m) (v)\ 2 dv < 00 



defined for a fixed positive integer m. Assume that the tails of the error 
distribution decrease exponentially fast: there exist positive constants K 
and erg, such that, for all z € [0, l] 2 , 

2K 2 E{e) w \l K - 1 - \W\/K\Z = z)<<%. 

Denote the distribution of (U,V) by Q and write ||/||2 for the L2(Q)-novm 
of a function /. Define functions e(v) = E(U\V = v) and h(u,v) = u — e(v). 
Assume that \\h\\2 > 0. 

Fix a positive Ao and take X n = \Qn~ m ^ 2m+1 \ Consider a class T of all 
regression functions / of the form f(u, v) = au + rj(v) with a G R and rj E.S. 
Denote the roughness of such a function by I 2 (f) = I 2 (r]) = J dv. 
Define 

g n = argmini - f^Yi - f(U h Vi)] 2 + X 2 n I 2 (f) 1 , 

the penalized least squares estimator of function g over the class T . Assume 
that the regression of U on V is sufficiently smooth by requiring 1(e) < 00. 
Given a function r from the class S and a real 5, define f Tj s{u,v) = [6 + 
8]u + [7(1*) + t{v) — 5e(v)j and note that function f T _$ is a member of the 
class T. Introduce criterion functions 

1 n 

G n (r,5) = -J2iYi ~ f T .s(^,Vi)] 2 + X 2 n I 2 (f T ,s). 

n r-r 

i=i 

Write g n {u, v) as [6 + 5 n ]u + [y(v) + r n (v) — 5 n e(v)] and observe that the pair 
(T n ,5 n ) minimizes G n over the class {(r, 8) : t £ S, 5 E R}. 
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Methods from penalized least-squares estimation establish the common 
rate of convergence for the two components of the estimator (r n , 6 n ). Define 
r = m/ (2m + 1). Stochastic bound \\g n — g\\2 = O p (n~ r )\s derived in Lemma 
11.1 of Van de Geer[19]. Note that \\g n — g\\\ = <5 2 ||/i|| 2 + ll^nllli because con- 
ditional expectation E(h(U, V)\V = v ) is zero for each v in [0,1]. Conclude 
that 5 n = O p (n~ r ) and ||r n ||2 = O p (n~ T ). 

Apply the approach of Section 3 to improve the convergence rate of S n . 
Write X n for the standardized sum n~ 1 ^ 2 J2i'=i h(Ui,Vi)Wi and deduce that 

G n (T,5)-G n (T,0) 

(10) 

= 5 2 Q n h 2 - 25[n~ l l 2 X n - Q n hr) + \ 2 n [I 2 (f T , s ) - I 2 (f T ,o)]- 

Equality Eh(U,V)T n (V) = implies Q n hr n = n _1 / 2 f n /ir n , and asymptotic 
equicontinuity of the empirical process indexed by functions {hr:r G S} 
yields Q n hT n = o p (n~ 1 / 2 ). Use the definition of the roughness to derive 
\ l2 {fr n ,s) ~ I 2 {fr n fl)\ < I{5e)I(2-y + 2r n - de). Note the stochastic bound 
I{ T n) = O p (l), implied by Van de Geer's Lemma 11.1, and conclude that 
^n[I 2 (fr n ,5) — I 2 (fr n fi)] = o p (n~ l / 2 5). Expression (10) evaluated at r = r n 
and 5 = 5 n simplifies to 

G n (T n ,6 n ) - G n (T n ,0) = 8 2 Q n h 2 - 25 n n- 1 / 2 [X n + o p (l)]. 

The law of large numbers yields Q n h 2 — > \\h\\2, and the limit is positive by 
assumption. Observe that X n = O p (l) and apply Theorem 2, with S 2 Q n h 2 
playing the role of M n (r, 5), to derive the correct n" 1 / 2 convergence rate of 

Note that 5 n minimizes the criterion function G n (r n , 5) — G n (r n , 0) over <5. 
Localize this function by writing 5 = n~ 1//2 i, and use the results of the pre- 
vious paragraph to derive a quadratic approximation that holds uniformly 
on compact a, 

(11) G n (T n ,n~ 1/2 t) - G n (T n , 0) = n~ l [t 2 Q n h 2 - 2tX n + o p (l)]. 

Define a 2 (z) = E(W 2 \Z = z) and note that X n X, where X ~ N(0, ||cr/t|||). 
Minimization of the random quadratic function in (11) yields n}l 2 b n = 
X n /Q n h 2 + Op(l), and a CLT for 5 n follows directly. Note that because the 
criterion functions G n (r n ,-) are convex, the formal derivation of the bound 
<5n = O p (n~ 1 / 2 ) could have been sidestepped. 

9. Proofs. 

9.1. Proof of Theorem 1. The next result is a version of the continuous 
mapping theorem. 
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Lemma 2 (Modified continuous mapping). Consider a metric space (X, d) . 
Let random maps X n : A n — > X be defined on some sets A n C £1 and consider 
a function g:X^ R rf that is continuous at every point of a set Xq C X. 
Suppose that X : $7 — » X is a Borel measurable map for which there ex- 
ists a Borel measurable set A, containing each of the sets A n , such that 
X E X Q on A. Suppose that F*{d(X n ,X) > e} n A n -> for all e > 0. Then 
F*{\\g(X n ) - g(X)\\ > 5} n A n for all 8 > 0. 

Proof. Apply a standard device for proving continuous mapping the- 
orems (see, e.g., the proof of Theorem 1.9.5 in van der Vaart and Wellner 
[21]). Fix a positive e. Let be the set of all x in X for which there exist 
y and z within the open ball of radius 1/k around x with \\g(y) — g(z)\\ > 5. 
Note that D k is open and the sequence is decreasing. Also note that 
F{X G Dk} n A I 0, because every point in HfcLi is a point in Xq. Ob- 
serve that, for every fixed k, 

P*{\\g(X n )-g(X)\\>5}nA n 

<F{x eD k }nA + F*{d(x n ,x) > i/k}nA n . 

The first term on the right-hand side can be made arbitrarily small by 
choosing k large enough. For a given choice of k, the second term tends to 
zero as n goes to infinity. □ 

Dudley [2] proved a representation theorem for the convergence in dis- 
tribution in the sense of Hoffmann-j0rgensen. The following argument uses 
Dudley's result in the convenient form of Theorem 9.4 in Pollard [11], re- 
ferred to as Representation Theorem. 

Proof of Theorem 1. Redefine function H n so that vector (s n ,t n ) 
becomes the unique minimum. This can be done by leaving function f n 
unchanged and decreasing function g n by a o*(l) amount at exactly the 
point (s n ,t n ). Note that the assumptions of the theorem remain valid after 
the change. 

It is enough to show that F*h(s n ,t n ) —*F*h(s*,t*) for all bounded, uni- 
formly continuous, real functions h on R dl x F, d2 . Invoke the Representa- 
tion Theorem for the convergence (f n ,9n) ~* (f,d), denote the correspond- 
ing perfect maps by <f) n and write u> for the elements of the new probabil- 
ity space. Simplify the notation by replacing the composition f n (4> n {u)),s) 
with f n (s), writing s n for s n (4> n (u>)), and so on, omitting the u>. Perfect- 
ness of <j) n implies \F*h(s n ,t n ) -Fh(s* ,t*)\ < F*\h(s n ,t n ) - h(s*,t*)\, hence, 
it is enough to show that random vectors (s n ,t n ) converge to (s*,t*) in 
outer probability (see, e.g., Theorem 1.9.5 in van der Vaart and Wellner). 
Write A r n for the subset of the new probability space that is defined by 
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{u : \\s n \\ V ||£ n || V ||S*|| V ||t*|| < r}. Because quantities \\t n \\, \\s*\\ and 

||t*|| are stochastically bounded with respect to P* , it is sufficient to check 
that, for each fixed r, 

P*{||a n -S*|| >6}nA r n ^0 and 

(12) - _ „ 

P*{||*n-**|| >6}nA 1 n ^Q for all 5. 

Fix a positive r and restrict all the functions on M. dl to the ball {||x|| < r}. 
Denote by X r the set of those (restricted) functions that are bounded and 
possess a unique minimum. Write *Sf r for the argmin map on X r . On the 
set A r n , the points s n and s* can be viewed as two values of the same map: 
s n = \J' r .[a~ 1 i? n (-, t n )] and s* = ^ r [f]- The two corresponding arguments are 
close when n is large. Indeed, on the set A r n , 

sup \a~ l H n (s,t n ) - f(s)\ 

\\s\\<r 

< sup \f n (s) - f(s)\ +(3 n /a n sup \g n (s,t)\. 

\\s\\<r ll s l|A||t||<r 

The right-hand side of the above inequality goes to zero in outer probability 
because of the bound from the Representation Theorem and the bounded- 
ness of g(s,t). Let A r stand for the Borel measurable set {u>: \\s*\\ < r} and 
observe that on the set A r the map ^> r is continuous at / [compare with 
condition (1) in Section 2]. Apply the modified continuous mapping lemma 
with A r n) A r and X r playing the role of A n , A and X, and deduce the first 
convergence in display (12). 

Define $ r analogously to but with respect to M. d2 . Note that t n = 
&r[9n(sn, •)] an d t* = &r[g(s* , •)] on the set A r n . Also, on this set, 

sup \g n (s n ,t) -g(s*,t)\ 

\\t\\<r 

< sup \g n (s,t)-g(s,t)\+ sup \g(s n ,t)-g(s*,t)\. 

|s||A||t||<r ll*ll< r 

The first term on the right-hand side tends to zero in outer probability be- 
cause of the bound from the Representation Theorem; the second term tends 
to zero in outer probability by the standard continuous mapping theorem. 
Deduce the second convergence in display (12) using an argument analogous 
to the one concluding the previous paragraph. □ 

9.2. Proofs of the results in Section 3. The following lemma simplifies 
the work with random polynomial functions. 

Lemma 3. Let a be positive and let {71, ... ,7p,r?i, ... ,rj p } be a collec- 
tion of nonnegative numbers satisfying 7$ < a for all i G {1, . . . ,p}. Define 
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r = minj<p( a ^' ). For each positive 5 and each 0*(1) sequence of random 
variables L n , there exist a 0*(1) sequence of random variables M n , such 
that the following upper bound holds for all positive u: 



\i<p 



n -Vi u -yi < Su a + M n n 
<p / 



Proof. It is enough to establish the bound for 5=1. Let M n (u) be the 
smallest real number satisfying the inequality sup u>0 (L n (u;) J2i< P n ~ THu ' yi ~ 
u a ) < M n (u;)n~ aT . Given a positive e, select a large enough L to ensure that 
P*{L n > L} < e. Note that 



P*{ M n > M} < P* \ sup ( L V ra" V - u a ) > Mn 
U>°V i< P J 



+ e. 



i<p 

To see that the first term on the right-hand side of the above inequality is 
zero for all M large enough, combine the upper bound 



sup 



L^re ^u 7 * - u a < maxsup(pLn *u 7i - u a ) 



«>0\ i<p J u>0 

with the inequalities 

sup(pLn^u 7 * - u a ) = an- am ^ a -^ < qu" 07 , i = l,...,p. 

u>0 

Conclude that M n = 0*(l). □ 

Proof of Theorem 2 is omitted because it is similar to the following argu- 
ment. 

Proof of Lemma 1. Deduce ||a n || Q! +||M /3 = Ol(J2i< P n ~ m \\{ a n-,b n )\[ 1i ) 
from inequality G n (a n ,b n ) — G n (0, 0) < 0. For each positive 5, use Lemma 
3 to establish 

||o„|| a + \\b n f < 6\\(a n ,b n )\\ a + 0;(n- aT °). 

Take a small enough 8 and use inequality a > (3 to derive ||a n || Q + \\b n \\^ = 
0*(n- aTa ). Conclude that Kll = ° P ( n ~ Ta ) and IIM = 0* p { n - ar -^). □ 

9.3. Proof of Theorem 3. According to condition (iv), 
G n (a, b) = G(a, b) + n~ 1/2 aV n Ai 

+ n" 1 / 2 6V n A 2 + n~ 1/2 ||(a, b)\\v n r afi . 

Combine this representation with the stochastic bound ||(a n ,6 n )|| = o p (l) 
and deduce the approximation G n (a n , b n ) = G(a n , b n ) + O p (n~ l l 2 \\(a n , b n )\\). 
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Apply Lemma 1 with function G playing the role of M n and derive the 
stochastic bounds \\a n \\ = O p (n~ Ta ) and ||6 n || = O p {n~ aTa ^). It follows from 
condition (v) that 

G n (a,b) - G n (a,0) 

= G{a, b) - G(a, 0) + rC^b' v n k 2 

+ n~ 1/2 \\b\\v n s a , b + n~ l/2 v n l a)b . 

Conditions (vi) and (vii) yield v n s an ,b n = o p (l) and vJa„,b n = o p (n~ /3r6+1/2 ). 
Thus, 

) - G n (a n ,0) 

(13) 

= G(a n ,b n )-G(a n ,0)+O p (n- 1 / 2 \\b n \\)+o p (n-^). 
Observe that J2?=i l^nH" -2 !!^!!* = O p (n -1 / 2 ||6 n ||). Consequently, 
G(a n , b n 

= MK)[l + o p (l)]+O p (^n^\\b n \\^ + 0p (n~ l l 2 \\b n \\). 

Combine this approximation with approximation (13) and let the term 
ip2(b n )[l + °p(l)] P^y th e role of M n (a n ,b n ) in Theorem 2. Conclude that 
\\b n \\=O p (n- T *). 

Introduce new variables s and t by s = n Ta a and t = n Tb b. Observe that 
v v 



and 



i=l i=l 



J2 \\n~ Ta s\\ a ~ i \\n- Tb t\\ i < n -M^-i)-r b ||a|| 0-< | 

i=l i=l 



-Pnn-iP-Wo-n] 

i=l 



,, '^2\\s\\ a % \\t\\\ 



Combine the last two displays with the approximations in conditions (iv) 
and (vii) to deduce 

G(n~ Ta s, n~ T H) = rT aTa U) X (s) + q n (s)] 

(14) 

+ n-^[Mt) + <Pi(s,t)l{X z =n} + w n (s,t)], 

where sup^ |g n (s)| = o(l) for every compact set K\ in M. dl , and sup^ 2 \ w n (s, 
t)\ = o(l) for every compact set K 2 in R dl x R d2 . Conditions (iv) through 
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(vii) yield 

G n {n- Ta s,n~ Tb t) = G{n~ Ta s,n~ Tb t) + n~ QT "[sV n Ai + q'Js)] 

(15) 



+ n- f3T »[n-( f3 ~ 1)[x °~ T >>k'v n A 2 + w' n (s,t)], 



where sup^ |(?^(s)| = o p (l) for every compact set K± in M. dl , and sup^ 2 \ w' n (s, 
t) \ = o p (l) for every compact set K2 in R dl x IR'' 2 . 

Denote G n (n~ Ta s,n~ Tb t) by H n {s,t). Combine approximations (14) and 
(15) and conclude that, uniformly on compacta in M. dl 



pd 2 



H n (s,t) 



n 



r »|# 1 (a) + a / i/ n Ai + Op(l) 



77 



^ 2 (i) + l{Ao = r 6 }t / i/ n A 2 



+ j2l{Xi=T b }4>i(s,t) + o p {l) 

i=l 

Note that ar a < /3tj,. Indeed, this inequality is valid in the case Xq = t^; 
in the case Ao 7^ t&, it follows from approximation (14) and the fact that 
function G assumes only nonnegative values near the origin. If ar a < (3Tb, 
apply Theorem 1 to complete the proof. If ar a = f3rb, the standard argmin 
theorem will suffice. 
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